In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine.
We’ll build two poisson regressions, two negative binomial regressions, and two multivariate linear regression models.
In order to explore summary stats and distribution characteristics of our dataset, we’ll need to first conduct some basic transformations and cleanup:
target, a numeric variable indicating the number of cases
purchased.target,
suggesting this data might be used for prediction rather than validation
and evaluation of model performance. For clarity we’ll rename this this
dataset ‘prediction’ instead and create a separate validation hold-out
from the training data.index column labeling the
observations which can be excluded from the models.| variable | complete_rate | n_missing | min | max |
|---|---|---|---|---|
| acidindex | 1.00 | 0 | 4.00 | 17.00 |
| alcohol | 0.95 | 838 | -4.70 | 26.50 |
| chlorides | 0.95 | 776 | -1.17 | 1.35 |
| citricacid | 1.00 | 0 | -3.24 | 3.86 |
| density | 1.00 | 0 | 0.89 | 1.10 |
| fixedacidity | 1.00 | 0 | -18.20 | 34.40 |
| freesulfurdioxide | 0.95 | 799 | -563.00 | 623.00 |
| labelappeal | 1.00 | 0 | -2.00 | 2.00 |
| ph | 0.97 | 499 | 0.48 | 6.21 |
| residualsugar | 0.95 | 784 | -128.30 | 145.40 |
| stars | 0.74 | 4200 | 1.00 | 4.00 |
| sulphates | 0.91 | 1520 | -3.13 | 4.24 |
| totalsulfurdioxide | 0.95 | 839 | -823.00 | 1057.00 |
| volatileacidity | 1.00 | 0 | -2.83 | 3.68 |
AcidIndex: Proprietary method of testing total acidity
of wine by using a weighted averageAlcohol: Alcohol ContentChlorides: Chloride content of wineCitricAcid: Citric Acid ContentDensity: Density of WineFixedAcidity: Fixed Acidity of WineFreeSulfurDioxide: Sulfur Dioxide content of wineLabelAppeal: Marketing Score indicating the appeal of
label design for consumers. High numbers suggest customers like the
label design. Negative numbers suggest customers don’t like the
design.ResidualSugar: Residual Sugar of wineStars: Wine rating by a team of experts. 4 Stars =
Excellent, 1 Star = Poor. A high number of stars suggests high
salesSulphates: Sulfate content of wineTotalSulfurDioxide: Total Sulfur Dioxide of WineVolatileAcidity: Volatile Acid content of winepH: pH of wineOne of the first characteristics that stand out is the presence of negative values for many chemical compounds, and the relative normality of their distributions. This suggests they have already been power-transformed to produce normal distributions for modeling.
Variables related to sugars, chlorides, acidity, sulfides and sulfates all seem to fall in this category. Considering that we are analyzing very tiny amounts of chemical compounds, we might assume their natural distributions may be highly skewed.
We tried exponentiation of these variables by the natural log and other values, but did not arrive at an obvious or consistent transformation approach - so we may not be able to interpret model results on the scale of the original values for these variables.
Next we’ll find and impute any missing data. There are 8 predictor variables that contain NAs:
| is_na | pct | |
|---|---|---|
| stars | 4200 | 0.26 |
| sulphates | 1520 | 0.09 |
| residualsugar | 784 | 0.05 |
| chlorides | 776 | 0.05 |
| freesulfurdioxide | 799 | 0.05 |
| totalsulfurdioxide | 839 | 0.05 |
| alcohol | 838 | 0.05 |
| ph | 499 | 0.03 |
Heeding the warning in the assignment, “sometimes, the fact that a variable is missing is actually predictive of the target”, we’ll consider each of these variables carefully. While there may be data “missing completely at random” (MCAR) that we wish to impute, this may not always be the case.
The predictor Stars suggests that out of 16,000 wine
samples, about 25% have never been professionally reviewed. If we assume
that the existence of a review has some impact on the sales of a wine
brand (whatever the reviewer’s sentiment), then imputing mean or
predicted values here might distort our model.
To enable further analysis we’ll convert stars from a
numeric to a factor, with a level ‘0’ representing our NA values.
Next we consider some of the missing chemical compounds in our wines;
alcohol, sugars, chlorides, sulfites and sulfates, and measures such as
ph.
First, can safely assume that all wines in this dataset have an
actual ph score greater than zero (which would represent
the most acidic rank, such as powerful industrial acids.) We’ll want to
impute more reasonable values for these.
Based on some reading into the organic wines segment, there is a growing demand in the market for specialty products such as low-sulfite, low-sugar and low-alcohol wines. However, this still represents a very small segment of the overall market, and chemically it’s not likely for these compounds to be completely absent from the final product.
Additionally, the predictors freesulfurdioxide and
totalsulfurdioxide are linked - the amount of ‘Free’ SO2 in
wine is always a subset of the ‘Total’ S02 present. We only identified
59 cases where both these values were NA, while over 1500 cases had
missing values for only one or the other.
Based on these observations, we’ll use the MICE imputation method to
predict and impute the missing values for residualsugar,
chlorides, freesulphurdioxide,
totalsulfurdioxide, sulphates,
alchohol and ph.
Target/source labels and non-chemical predictors
labelappeal and stars were excluded as
predictors for the imputation.
labelappeal is a numeric score of consumer ratings for a
wine brand’s label design. It has also been pre-transformed to produce a
normal distribution for modeling; however this is a very sparse variable
with nearly half the cases having a value of zero.
This may be candidate for handling with Zero-Inflated models. We
won’t change the values here, but will convert labelappeal
from a numeric to a factor.
We now have reasonably imputed values, and nearly-normal
distributions for our numeric predictors, taking special note of the
frequency of zero values for labelappeal and
stars.
| variable | n_missing | n_zero |
|---|---|---|
| acidindex | 0 | 0 |
| alcohol | 0 | 5 |
| chlorides | 0 | 8 |
| citricacid | 0 | 151 |
| density | 0 | 0 |
| fixedacidity | 0 | 47 |
| freesulfurdioxide | 0 | 12 |
| labelappeal | 0 | 7087 |
| ph | 0 | 0 |
| residualsugar | 0 | 6 |
| stars | 0 | 4200 |
| sulphates | 0 | 29 |
| totalsulfurdioxide | 0 | 12 |
| volatileacidity | 0 | 22 |
With transformations complete, we split back into training and
prediction datasets based on our source_flag, and create a
15% validation hold-out from the training data.
Poisson Regression assumes that the variance and mean of our
dependent variable target are roughly equal, otherwise we
may be looking at over- or under-dispersion.
pr1 <- glm(target ~ ., family = 'poisson', data = df_train)
##
## Call:
## glm(formula = target ~ ., family = "poisson", data = df_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.607e-01 2.164e-01 3.053 0.002268 **
## fixedacidity -3.809e-04 8.856e-04 -0.430 0.667148
## volatileacidity -3.387e-02 7.093e-03 -4.776 1.79e-06 ***
## citricacid 5.441e-03 6.408e-03 0.849 0.395807
## residualsugar 7.920e-05 1.632e-04 0.485 0.627481
## chlorides -3.139e-02 1.735e-02 -1.810 0.070361 .
## freesulfurdioxide 8.966e-05 3.701e-05 2.422 0.015418 *
## totalsulfurdioxide 8.205e-05 2.422e-05 3.388 0.000704 ***
## density -2.293e-01 2.084e-01 -1.101 0.271049
## ph -1.128e-02 8.190e-03 -1.377 0.168366
## sulphates -6.587e-03 5.906e-03 -1.115 0.264669
## alcohol 3.583e-03 1.500e-03 2.388 0.016928 *
## labelappeal-1 2.206e-01 4.128e-02 5.345 9.05e-08 ***
## labelappeal0 4.158e-01 4.021e-02 10.341 < 2e-16 ***
## labelappeal1 5.450e-01 4.090e-02 13.325 < 2e-16 ***
## labelappeal2 6.927e-01 4.609e-02 15.030 < 2e-16 ***
## acidindex -7.994e-02 4.996e-03 -16.001 < 2e-16 ***
## stars1 7.850e-01 2.123e-02 36.981 < 2e-16 ***
## stars2 1.096e+00 1.984e-02 55.225 < 2e-16 ***
## stars3 1.215e+00 2.088e-02 58.193 < 2e-16 ***
## stars4 1.330e+00 2.633e-02 50.523 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 19379 on 10831 degrees of freedom
## Residual deviance: 11477 on 10811 degrees of freedom
## AIC: 38540
##
## Number of Fisher Scoring iterations: 6
| x | |
|---|---|
| AIC | 38539.67 |
| Dispersion | 0.88 |
| Log-Lik | -19248.84 |
We note that our model has generated ‘dummies’ from our categorical
variables labelappeal and stars, and of the 20
total predictors, all but five have statistical significance.
Notably, our Dispersion Parameter is 0.88, which suggests a degree of under-dispersion in the data.
By graphing our target values (green) against our predicted values (blue) we can easily see this model tends to under-predict the higher count levels, and wildly over-predict the lower count levels.
We’ll build a Zero-Inflated Poisson model to handle the large number
of zero values in our labelappeal and stars
predictors, to see if we can improve model accuracy.
pr2 <- zeroinfl(target ~ . | ., data=df_train, dist = 'poisson')
##
## Call:
## zeroinfl(formula = target ~ . | ., data = df_train, dist = "poisson")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.275897 -0.428242 -0.001392 0.382528 5.195474
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.205e-01 2.238e-01 3.219 0.001288 **
## fixedacidity 1.885e-04 9.089e-04 0.207 0.835675
## volatileacidity -1.190e-02 7.304e-03 -1.629 0.103221
## citricacid 2.826e-03 6.529e-03 0.433 0.665133
## residualsugar -6.728e-05 1.671e-04 -0.403 0.687225
## chlorides -2.201e-02 1.782e-02 -1.235 0.216706
## freesulfurdioxide 1.554e-05 3.723e-05 0.417 0.676474
## totalsulfurdioxide -1.600e-05 2.418e-05 -0.662 0.508230
## density -2.167e-01 2.149e-01 -1.008 0.313227
## ph 4.417e-03 8.386e-03 0.527 0.598379
## sulphates 1.815e-03 6.061e-03 0.299 0.764636
## alcohol 6.170e-03 1.537e-03 4.016 5.93e-05 ***
## labelappeal-1 4.351e-01 4.489e-02 9.694 < 2e-16 ***
## labelappeal0 7.245e-01 4.383e-02 16.532 < 2e-16 ***
## labelappeal1 9.131e-01 4.456e-02 20.491 < 2e-16 ***
## labelappeal2 1.071e+00 4.947e-02 21.647 < 2e-16 ***
## acidindex -1.931e-02 5.336e-03 -3.619 0.000295 ***
## stars1 6.819e-02 2.299e-02 2.967 0.003010 **
## stars2 1.888e-01 2.152e-02 8.772 < 2e-16 ***
## stars3 2.870e-01 2.252e-02 12.742 < 2e-16 ***
## stars4 3.833e-01 2.777e-02 13.804 < 2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.808e+00 1.517e+00 -4.486 7.24e-06 ***
## fixedacidity 3.174e-03 5.903e-03 0.538 0.590865
## volatileacidity 2.159e-01 4.783e-02 4.513 6.38e-06 ***
## citricacid -1.453e-02 4.351e-02 -0.334 0.738408
## residualsugar -1.410e-03 1.096e-03 -1.286 0.198283
## chlorides 2.097e-02 1.168e-01 0.180 0.857473
## freesulfurdioxide -8.157e-04 2.531e-04 -3.223 0.001270 **
## totalsulfurdioxide -1.010e-03 1.629e-04 -6.201 5.60e-10 ***
## density 6.806e-01 1.425e+00 0.478 0.632943
## ph 1.955e-01 5.412e-02 3.613 0.000303 ***
## sulphates 8.649e-02 3.953e-02 2.188 0.028671 *
## alcohol 1.860e-02 1.009e-02 1.844 0.065253 .
## labelappeal-1 1.646e+00 3.857e-01 4.266 1.99e-05 ***
## labelappeal0 2.365e+00 3.833e-01 6.169 6.87e-10 ***
## labelappeal1 3.088e+00 3.890e-01 7.938 2.06e-15 ***
## labelappeal2 3.419e+00 4.426e-01 7.725 1.12e-14 ***
## acidindex 4.290e-01 2.869e-02 14.952 < 2e-16 ***
## stars1 -2.135e+00 8.357e-02 -25.548 < 2e-16 ***
## stars2 -5.679e+00 3.388e-01 -16.764 < 2e-16 ***
## stars3 -2.025e+01 3.693e+02 -0.055 0.956273
## stars4 -2.036e+01 6.933e+02 -0.029 0.976572
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 47
## Log-likelihood: -1.719e+04 on 42 Df
| x | |
|---|---|
| AIC | 34471.98 |
| Dispersion | 0.45 |
| Log-Lik | -17193.99 |
Using a Zero-Inflated model, the Dispersion Parameter drops significantly, but we are getting a better overall result for counts of 3 or more. By graphing our target values (green) against our predicted values (blue) we can see we are getting much greater accuracy rate for most of the mid- and upper counts.
Notably, we are still under-predicting counts of 1-2, and greatly over-predicting counts of zero.
Generally, we would use Negative Binomial Regression in cases of over-dispersion (where the variance of our dependent variable is significantly greater than the mean.) This does not appear to be the case with our dataset, but we’ll apply it here and examine the results:
nb1 <- glm.nb(target ~ ., data = df_train)
##
## Call:
## glm.nb(formula = target ~ ., data = df_train, init.theta = 40652.27005,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.607e-01 2.164e-01 3.053 0.002268 **
## fixedacidity -3.809e-04 8.857e-04 -0.430 0.667150
## volatileacidity -3.388e-02 7.094e-03 -4.776 1.79e-06 ***
## citricacid 5.441e-03 6.408e-03 0.849 0.395823
## residualsugar 7.921e-05 1.632e-04 0.485 0.627468
## chlorides -3.139e-02 1.735e-02 -1.810 0.070366 .
## freesulfurdioxide 8.967e-05 3.702e-05 2.422 0.015419 *
## totalsulfurdioxide 8.206e-05 2.422e-05 3.388 0.000704 ***
## density -2.293e-01 2.084e-01 -1.101 0.271059
## ph -1.128e-02 8.191e-03 -1.378 0.168352
## sulphates -6.588e-03 5.906e-03 -1.115 0.264662
## alcohol 3.583e-03 1.500e-03 2.388 0.016935 *
## labelappeal-1 2.206e-01 4.128e-02 5.345 9.06e-08 ***
## labelappeal0 4.158e-01 4.021e-02 10.340 < 2e-16 ***
## labelappeal1 5.450e-01 4.090e-02 13.324 < 2e-16 ***
## labelappeal2 6.927e-01 4.609e-02 15.030 < 2e-16 ***
## acidindex -7.994e-02 4.996e-03 -16.001 < 2e-16 ***
## stars1 7.850e-01 2.123e-02 36.980 < 2e-16 ***
## stars2 1.096e+00 1.984e-02 55.224 < 2e-16 ***
## stars3 1.215e+00 2.088e-02 58.191 < 2e-16 ***
## stars4 1.330e+00 2.633e-02 50.521 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(40652.27) family taken to be 1)
##
## Null deviance: 19378 on 10831 degrees of freedom
## Residual deviance: 11477 on 10811 degrees of freedom
## AIC: 38542
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 40652
## Std. Err.: 36846
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -38498.03
| x | |
|---|---|
| AIC | 38542.03 |
| Dispersion | 0.88 |
| Log-Lik | -19249.02 |
As expected, the Negative Binomial Regression does not outperform the Poisson.
We’ll build a Zero-Inflated Negative Binomial model to handle the
large number of zero values in our labelappeal and
stars predictors, to see if we can improve model
accuracy.
nb2 <- zeroinfl(target ~ . | ., data=df_train, dist = 'negbin')
##
## Call:
## zeroinfl(formula = target ~ . | ., data = df_train, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.275890 -0.428238 -0.001389 0.382527 5.195515
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.204e-01 2.238e-01 3.219 0.001288 **
## fixedacidity 1.885e-04 9.089e-04 0.207 0.835670
## volatileacidity -1.190e-02 7.304e-03 -1.629 0.103225
## citricacid 2.826e-03 6.529e-03 0.433 0.665139
## residualsugar -6.728e-05 1.671e-04 -0.403 0.687221
## chlorides -2.201e-02 1.782e-02 -1.235 0.216708
## freesulfurdioxide 1.553e-05 3.723e-05 0.417 0.676476
## totalsulfurdioxide -1.599e-05 2.418e-05 -0.662 0.508290
## density -2.167e-01 2.149e-01 -1.008 0.313280
## ph 4.418e-03 8.386e-03 0.527 0.598367
## sulphates 1.815e-03 6.061e-03 0.299 0.764627
## alcohol 6.170e-03 1.537e-03 4.016 5.93e-05 ***
## labelappeal-1 4.351e-01 4.489e-02 9.694 < 2e-16 ***
## labelappeal0 7.245e-01 4.383e-02 16.532 < 2e-16 ***
## labelappeal1 9.131e-01 4.456e-02 20.492 < 2e-16 ***
## labelappeal2 1.071e+00 4.947e-02 21.647 < 2e-16 ***
## acidindex -1.931e-02 5.336e-03 -3.619 0.000295 ***
## stars1 6.819e-02 2.299e-02 2.967 0.003010 **
## stars2 1.888e-01 2.152e-02 8.772 < 2e-16 ***
## stars3 2.870e-01 2.252e-02 12.742 < 2e-16 ***
## stars4 3.833e-01 2.777e-02 13.804 < 2e-16 ***
## Log(theta) 1.746e+01 NaN NaN NaN
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.808e+00 1.517e+00 -4.486 7.24e-06 ***
## fixedacidity 3.174e-03 5.903e-03 0.538 0.590857
## volatileacidity 2.159e-01 4.783e-02 4.514 6.38e-06 ***
## citricacid -1.453e-02 4.351e-02 -0.334 0.738496
## residualsugar -1.410e-03 1.096e-03 -1.286 0.198279
## chlorides 2.097e-02 1.168e-01 0.180 0.857452
## freesulfurdioxide -8.158e-04 2.531e-04 -3.223 0.001270 **
## totalsulfurdioxide -1.010e-03 1.629e-04 -6.201 5.60e-10 ***
## density 6.807e-01 1.425e+00 0.478 0.632890
## ph 1.955e-01 5.412e-02 3.613 0.000303 ***
## sulphates 8.649e-02 3.953e-02 2.188 0.028665 *
## alcohol 1.860e-02 1.009e-02 1.844 0.065249 .
## labelappeal-1 1.646e+00 3.857e-01 4.266 1.99e-05 ***
## labelappeal0 2.365e+00 3.833e-01 6.169 6.86e-10 ***
## labelappeal1 3.088e+00 3.890e-01 7.938 2.05e-15 ***
## labelappeal2 3.419e+00 4.426e-01 7.726 1.11e-14 ***
## acidindex 4.290e-01 2.869e-02 14.952 < 2e-16 ***
## stars1 -2.135e+00 8.357e-02 -25.548 < 2e-16 ***
## stars2 -5.679e+00 3.388e-01 -16.765 < 2e-16 ***
## stars3 -2.026e+01 3.704e+02 -0.055 0.956386
## stars4 -2.036e+01 6.939e+02 -0.029 0.976589
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 38155671.6535
## Number of iterations in BFGS optimization: 58
## Log-likelihood: -1.719e+04 on 43 Df
| x | |
|---|---|
| AIC | 34473.98 |
| Dispersion | 0.45 |
| Log-Lik | -17193.99 |
The Zero-Inflated Negative Binomial model sees similar improvement as with the Zero-Inflated Poisson, but as before does not outperform the Poisson.
For our first Multiple Linear Regression, we’ll use all predictors.
lm1 <- lm(target ~ ., data=df_train)
##
## Call:
## lm(formula = target ~ ., data = df_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0032 -0.8546 0.0105 0.8407 5.5618
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.775e+00 4.828e-01 5.748 9.26e-09 ***
## fixedacidity -6.975e-04 1.994e-03 -0.350 0.726459
## volatileacidity -1.045e-01 1.597e-02 -6.544 6.28e-11 ***
## citricacid 1.838e-02 1.455e-02 1.263 0.206660
## residualsugar 2.339e-04 3.684e-04 0.635 0.525561
## chlorides -9.922e-02 3.915e-02 -2.534 0.011286 *
## freesulfurdioxide 2.629e-04 8.373e-05 3.140 0.001695 **
## totalsulfurdioxide 2.373e-04 5.434e-05 4.367 1.27e-05 ***
## density -7.335e-01 4.707e-01 -1.558 0.119215
## ph -2.940e-02 1.841e-02 -1.597 0.110383
## sulphates -1.332e-02 1.329e-02 -1.002 0.316209
## alcohol 1.176e-02 3.370e-03 3.491 0.000483 ***
## labelappeal-1 3.389e-01 6.818e-02 4.970 6.79e-07 ***
## labelappeal0 8.168e-01 6.643e-02 12.295 < 2e-16 ***
## labelappeal1 1.270e+00 6.934e-02 18.310 < 2e-16 ***
## labelappeal2 1.899e+00 9.160e-02 20.729 < 2e-16 ***
## acidindex -1.993e-01 9.879e-03 -20.180 < 2e-16 ***
## stars1 1.398e+00 3.554e-02 39.320 < 2e-16 ***
## stars2 2.411e+00 3.457e-02 69.757 < 2e-16 ***
## stars3 2.974e+00 4.001e-02 74.339 < 2e-16 ***
## stars4 3.647e+00 6.357e-02 57.364 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.299 on 10811 degrees of freedom
## Multiple R-squared: 0.546, Adjusted R-squared: 0.5451
## F-statistic: 650 on 20 and 10811 DF, p-value: < 2.2e-16
| x | |
|---|---|
| AIC | 36431.79 |
| Adj R2 | 0.55 |
…
For our second Multiple Linear Regression, we’ll add stepwise feature selection.
lm2_all <- lm(target ~ ., data=df_train)
lm2 <- stepAIC(lm2_all, trace=FALSE, direction='both')
##
## Call:
## lm(formula = target ~ volatileacidity + chlorides + freesulfurdioxide +
## totalsulfurdioxide + density + ph + alcohol + labelappeal +
## acidindex + stars, data = df_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0357 -0.8560 0.0131 0.8396 5.5892
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.772e+00 4.825e-01 5.745 9.46e-09 ***
## volatileacidity -1.048e-01 1.597e-02 -6.566 5.42e-11 ***
## chlorides -1.000e-01 3.915e-02 -2.556 0.01061 *
## freesulfurdioxide 2.633e-04 8.368e-05 3.146 0.00166 **
## totalsulfurdioxide 2.390e-04 5.431e-05 4.400 1.09e-05 ***
## density -7.368e-01 4.706e-01 -1.566 0.11742
## ph -2.922e-02 1.841e-02 -1.587 0.11249
## alcohol 1.178e-02 3.368e-03 3.499 0.00047 ***
## labelappeal-1 3.392e-01 6.818e-02 4.976 6.59e-07 ***
## labelappeal0 8.168e-01 6.642e-02 12.298 < 2e-16 ***
## labelappeal1 1.270e+00 6.932e-02 18.319 < 2e-16 ***
## labelappeal2 1.899e+00 9.160e-02 20.733 < 2e-16 ***
## acidindex -1.994e-01 9.699e-03 -20.564 < 2e-16 ***
## stars1 1.398e+00 3.554e-02 39.354 < 2e-16 ***
## stars2 2.413e+00 3.455e-02 69.840 < 2e-16 ***
## stars3 2.976e+00 3.999e-02 74.407 < 2e-16 ***
## stars4 3.649e+00 6.356e-02 57.411 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.299 on 10815 degrees of freedom
## Multiple R-squared: 0.5458, Adjusted R-squared: 0.5452
## F-statistic: 812.3 on 16 and 10815 DF, p-value: < 2.2e-16
| x | |
|---|---|
| AIC | 36426.96 |
| Adj R2 | 0.55 |
…
‘Total Sulfur Dioxide – Why it Matters, Too!’
Iowa State University
https://www.extension.iastate.edu/wine/total-sulfur-dioxide-why-it-matters-too/